Towards Low Carbon Similarity Search with Compressed Sketches

نویسندگان

  • Arnoldo José Müller Molina
  • Takeshi Shinohara
چکیده

Sketches are compact bit string representations of objects. Objects that have the same sketch are stored in the same database bucket. By calculating the hamming distance of the sketches, an estimation of the similarity of their respective objects can be obtained. Objects that are close to each other are expected to have sketches with small hamming distance values. This estimation helps to schedule the order in which buckets are visited during search time. Recent research has shown that sketches can effectively approximate L1 and L2 distances in high dimensional settings. A remaining task is to provide a general sketch for arbitrary metric spaces. This paper presents a novel sketch based on generalized hyperplane partitioning that can be employed on arbitrary metric spaces. The core of the sketch is a heuristic that tries to generate balanced partitions. The indexing method AESA stores all the distances among database objects, and this allows it to perform a small number of distance computations. Experimental evaluations show that our algorithm performs up to one order of magnitude fewer distance operations than AESA in string spaces. Comparisons against other methods show greater gains. Furthermore, we experimentally demonstrate that it is possible to reduce the physical size of the sketches by a factor of ten with different run length encodings.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Beyond "project and sign" for cosine estimation with binary codes

Many nearest neighbor search algorithms rely on encoding real vectors into binary vectors. The most common strategy projects the vectors onto random directions and takes the sign to produce so-called sketches. This paper discusses the sub-optimality of this choice, and proposes a better encoding strategy based on the quantization and reconstruction points of view. Our second contribution is a n...

متن کامل

Streaming Binary Sketching based on Subspace Tracking and Diagonal Uniformization

In this paper, we address the problem of learning compact similarity-preserving embeddings for massive high-dimensional streams of data in order to perform efficient similarity search. We present a new method for computing binary compressed representations -sketchesof high-dimensional real feature vectors. Given an expected code length c and high-dimensional input data points, our algorithm pro...

متن کامل

Fast image search on a VQ compressed image database

A fast and efficient image search method is developed for a compressed image database using vector quantization (VQ). An image search on an image database requires an exhaustive sequential scan of all the images, given the similarity measure. If compressed images are dealt with, images are decompressed as an initial operation and then the previously mentioned exhaustive search is performed usin...

متن کامل

Texture Evolution in Low Carbon Steel Fabricated by Multi-directional Forging of the Martensite Starting Structuree

It has been clarified that deformation and annealing of martensite starting structure can produce ultrafine-grained structure in low carbon steel.  This study aims to investigate the texture evolution and mechanical properties of samples with martensite structure deformed by two different forging processes. The martensitic steel samples were forged by plane strain compression and multi-directio...

متن کامل

Efficient Compression Technique for Sparse Sets

Recent technological advancements have led to the generation of huge amounts of data over the web, such as text, image, audio and video. Needless to say, most of this data is high dimensional and sparse, consider, for instance, the bag-of-words representation used for representing text. O‰en, an ecient search for similar data points needs to be performed in many applications like clustering, n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009